122 research outputs found

    Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements

    Get PDF
    Extensive research efforts are being carried out to evaluate and improve the reliability of computing devices either through beam experiments or simulation-based fault injection. Unfortunately, it is still largely unclear to which extend fault injection can provide an accurate error rate estimation at early stages and if beam experiments can be used to identify the weakest resources in a device. The importance and challenges associated with a timely, but yet realistic reliability evaluation grow with the increase of complexity in both the hardware domain, with the integration of different types of cores in an SoC (System-on-Chip), and the software domain, with the OS (operating system) required to take full advantage of the available resources. In this paper, we combine and analyze data gathered with extensive beam experiments (on the final physical CPU hardware) and microarchitectural fault injections (on early microarchitectural CPU models). We target a standalone Arm Cortex-A5 CPU and an Arm Cortex-A9 CPU integrated into an SoC and evaluate their reliability in bare-metal and Linux-based configurations. Combining experimental data that covers more than 18 million years of device time with the result of more than 176,000 injections we find that both the SoC integration and the presence of the OS increase the system DUEs (Detected Unrecoverable Errors) rate (for different reasons) but do not significantly impact the SDCs (Silent Data Corruptions) rate which is solely attributed to the CPU core. Our reliability analysis demonstrates that even considering SoC integration and OS inclusion, early, pre-silicon microarchitecture-level fault injection delivers accurate SDC rates estimations and lower bounds for the DUE rates

    High Energy and Thermal Neutrons Sensitivity of Google Tensor Processing Units

    Get PDF
    In this article, we investigate the reliability of Google’s coral tensor processing units (TPUs) to both high-energy atmospheric neutrons (at ChipIR) and thermal neutrons from a pulsed source [at equipment materials and mechanics analyzer (EMMA)] and from a reactor [at Thermal and Epithermal Neutron Irradiation Station (TENIS)]. We report data obtained with an overall fluence of 3.41×1012n/cm2 for atmospheric neutrons (equivalent to more than 30 million years of natural irradiation) and of 7.55×1012n/cm2 for thermal neutrons. We evaluate the behavior of TPUs executing elementary operations with increasing input sizes (standard convolutions or depthwise convolutions) as well as eight convolutional neural networks (CNNs) configurations (single-shot multibox detection (SSD) MobileNet v2 and SSD MobileDet, trained with COCO dataset, and Inception v4 and ResNet-50, with ILSVRC2012 dataset). We found that, despite the high error rate, most neutron-induced errors only slightly modify the convolution output and do not change the detection or classification of CNNs. By reporting details about the error model, we provide valuable information on how to design the CNNs to avoid neutron-induced events to lead to misdetections or classifications

    Evaluating Architectural, Redundancy, and Implementation Strategies for Radiation Hardening of FinFET Integrated Circuits

    Get PDF
    In this article, authors explore radiation hardening techniques through the design of a test chip implemented in 16-nm FinFET technology, along with architectural and redundancy design space exploration of its modules. Nine variants of matrix multiplication were taped out and irradiated with neutrons. The results obtained from the neutron campaign revealed that the radiation-hardened variants present superior resiliency when either local or global triple modular redundancy (TMR) schemes are employed. Furthermore, simulation-based fault injection was utilized to validate the measurements and to explore the effects of different implementation strategies on failure rates. We further show that the interplay between these different implementation strategies is not trivial to capture and that synthesis optimizations can effectively break assumptions about the effectiveness of redundancy schemes

    Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

    Get PDF
    In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft

    Understanding the Impact of Cutting in Quantum Circuits Reliability to Transient Faults

    Get PDF
    Quantum Computing is a highly promising new computation paradigm. Unfortunately, quantum bits (qubits) are extremely fragile and their state can be gradually or suddenly modified by intrinsic noise or external perturbation. In this paper, we target the sensitivity of quantum circuits to radiation-induced transient faults. We consider quantum circuit cuts that split the circuit into smaller independent portions, and understand how faults propagate in each portion. As we show, the cuts have different vulnerabilities, and our methodology successfully identifies the circuit portion that is more likely to contribute to the overall circuit error rate. Our evaluation shows that a circuit cut can have a 4.6x higher probability than the other cuts, when corrupted, to modify the circuit output. Our study, identifying the most critical cuts, moves towards the possibility of implementing a selective hardening for quantum circuits

    Performance-Reliability Trade-Off in Graphics Processing Units

    Get PDF
    International audienceWe show that most performance improvements in GPUs increase the number of executions correctly completed before experiencing a failure.We consider four different performance improvements: architectural solutions, software implementations, compiler optimizations, and degree of parallelism

    Understanding the Effect of Transpilation in the Reliability of Quantum Circuits

    Get PDF
    Transpiling is a necessary step to map a logical quantum algorithm to a circuit executed on a physical quantum machine, according to the available gate set and connectivity topology. Different transpiling approaches try to minimize the most critical parameters for the current transmon technology, such as Depth and CNOT number. Crucially, these approaches do not take into account the reliability of the circuit. In particular, transpilation can modify how radiation-induced transient faults propagate. In this paper, we aim at advancing the understanding of transpilation impact on fault propagation by investigating the low-level reliability of several transpiling approaches. We considered 4 quantum algorithms transpiled for 2 different architectures, increasing the number of qubits, and all possible logical-to-physical qubit mapping, adding to a total of 4, 640 transpiled circuits. We inject a total of 202, 124 faults and track their propagation. Our experiments show that by simply choosing the proper transpilation, the reliability of the circuit can improve by up to 14%

    Protecting GPU's Microarchitectural Vulnerabilities via Effective Selective Hardening

    Get PDF
    Graphics Processing Units (GPUs) are today adopted in several domains for which reliability is fundamental, such as self-driving cars and autonomous machines. Unfortunately, on one side GPUs have been shown to have a high error rate and, on the other side, the constraints imposed by real-time safety-critical applications make traditional, costly, replication-based hardening solutions inadequate. This paper proposes an effective microarchitectural selective hardening of GPU modules to mitigate those faults that affect instructions correct execution. We first characterize, through Register-Transfer Level (RTL) fault injections, the architectural vulnerabilities of a GPU model (FlexGripPlus). We specifically target transient faults in the functional units and pipeline registers of a GPU core. Then, we apply selective hardening by triplicating the locations in each module that we found to be more critical. The results show that selective hardening using Triple Modular Redundancy (TMR) can correct 85% to 99% of faults in the pipeline registers and from 50% to 100% of faults in the functional units. The proposed selective TMR strategy reduces the hardware overhead by up to 65% when compared with traditional TMR

    Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection

    Get PDF
    The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register-Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility

    Transient-fault-aware design and training to enhance DNNs reliability with zero-overhead

    Get PDF
    Deep Neural Networks (DNNs) enable a wide series of technological advancements, ranging from clinical imaging, to predictive industrial maintenance and autonomous driving. However, recent findings indicate that transient hardware faults may corrupt the models prediction dramatically. For instance, the radiation-induced misprediction probability can be so high to impede a safe deployment of DNNs models at scale, urging the need for efficient and effective hardening solutions. In this work, we propose to tackle the reliability issue both at training and model design time. First, we show that vanilla models are highly affected by transient faults, that can induce a performances drop up to 37%. Hence, we provide three zero-overhead solutions, based on DNN re-design and re-train, that can improve DNNs reliability to transient faults up to one order of magnitude. We complement our work with extensive ablation studies to quantify the gain in performances of each hardening component
    • …
    corecore